This document describes the Apple® Speech Manager, which provides a standardized method for Macintosh® applications to generate synthesized speech.
The document provides an overview of the Speech Manager followed by general information about generating speech from text. The necessary information and calls needed by all text-to-speech applications are given next, followed by a simple example of speech generation. More advanced calls and special-purpose routines are described last.
Speech Manager Overview
A complete system for speech synthesis consists of the elements shown in Figure 1-1.
Figure 1-1 Speech synthesis components
An application calls routines in the Speech Manager to convert character strings into speech and to adjust various parameters that affect the quality or character of the spoken output. The Speech Manager is responsible for dispatching these requests to a speech synthesizer. The speech synthesizer converts the text into sound and creates the actual audio output.
The Apple-supplied voices, pronunciation dictionaries, and speech synthesizer may reside in a single file or in separate files. These files are clearly identifiable as Speech Manager–related files and are installed and removed by being dragged into or out of the System Folder. Additional voices can be provided by bundling the resources in the resource forks of specific applications. These resources are considered private to that particular application. It is up to the individual developers to decide whether the voice resources they provide are usable on a systemwide basis or only from within their applications.
In the first release of the Speech Manager, pronunciation dictionaries are managed entirely by the application. The application is free to store dictionaries in either the resource or the data fork of a file. The application is responsible for loading the individual dictionaries into RAM and then passing a handle to the dictionary data to the Speech Manager.
Applications that use the Speech Manager must provide their own human interface for selecting voices and/or controlling other speech characteristics. If voices are provided in separate files, the speech synthesizer developer is responsible for providing a method for installing these resources into the System Folder or Extensions folder. The computer must be rebooted after speech synthesizers are added to or removed from the System Folder for the desired changes to be recognized.
Speech Manager Concepts
On a simple level, speech synthesis from text input is a two-stage process. First, plain-language English text is converted into phonemic representations for the individual words. Phonemes stand for specific sounds; for a complete explanation, see “Summary of Phonemes and Prosodic Controls,” later in this document. The resulting sequence of phonemes is converted into audible sounds by mapping of the individual phonemes to a series of waveforms, which are sent to the sound hardware to be played.
In reality, each stage is more complicated than this description suggests. For example, during the text-to-phoneme conversion stage, number strings, abbreviations, and special symbols must be detected and converted into appropriate words before being converted into phonemes. When a sentence such as “He earned over $2,000,000 in 1990” is spoken, it would normally be preferable to say “He earned over two million dollars in nineteen- ninety” rather than “He earned over dollar-sign, two, comma, zero, zero, zero, comma, zero, zero, zero, in one, nine, nine, zero.” To produce the desired spoken output automatically, knowledge of these sorts of constructions is built into the synthesizer.
The phoneme-to-sound conversion stage is also complex. Phonemes by themselves are often not sufficient to describe the way a word should be pronounced. For example, the word “object” is pronounced differently depending on whether it is used as a noun or a verb. (When it is used as a noun, the stress is placed on the first syllable. As a verb, the stress is placed on the second syllable.) In addition to stress information, phonemes must often be augmented with pitch, duration, and other information to produce intelligible, natural-sounding speech.
The speech synthesizer has many built-in rules for automatically converting text into the complex phonemic representation described above. However, there will always be words and phrases that are not pronounced the way you want. The Speech Manager allows you to provide raw phonemic information directly in order to enable very precise control over the spoken output.
By default, speech synthesizers expect input in normal language text. However, using the input mode controls of the Speech Manager, you can tell the synthesizer to process input text in raw phonemic form. By using the embedded commands described in the next section, you can even mix normal language text with phonemic text within a single string or text buffer.
See “Summary of Phonemes and Prosodic Controls,” later in this document, for a listing of the phonemic character set and each character’s interpretation.
Using the Speech Manager
This section describes the routines used to add speech synthesis features to an application. It is organized into three sections: “Getting Started” (Easy), “Essential Calls—Simple and Useful” (Intermediate), and “Advanced Routines.”
Getting Started
If you’re just getting started with text-to-speech conversion using the Speech Manager, the following routines will get you up and running with minimal effort. If you’re developing an application that does not need to choose voices, use more than one channel of speech, or exercise real-time control over the synthesized speech, these may be the only routines you need.
Determining If the Speech Manager Is Available
You can find out if the Speech Manager is available with a single call to the Gestalt Manager.
Use the Gestalt toolbox routine and the selector gestaltSpeechAttr to determine whether or not the Speech Manager is available, as shown in Listing 1-1. If Gestalt returns noErr, then the parameter argument will contain a 32-bit value indicating one or more attributes of the installed Speech Manager. If the Speech Manager exists, the bit specified by gestaltSpeechMgrPresent is set.
Listing 1-1 Determining if the Speech Manager is available
Boolean SpeechAvailable (void) {
OSErr err;
long result;
err = Gestalt(gestaltSpeechAttr, &result);
if ((err != noErr) || !(result &
(1 << gestaltSpeechMgrPresent)))
return FALSE;
else
return TRUE;
}
Which Version of the Speech Manager Is Running?
Once you have determined that the Speech Manager is installed, you can see which version of the Speech Manager is running by calling SpeechManagerVersion.
SpeechManagerVersion
Returns the version of the Speech Manager installed in the system.
pascal NumVersion SpeechManagerVersion (void);
DESCRIPTION
SpeechManagerVersion returns the version of the Speech Manager installed in the system. This call should be used to determine the compatibility of your program with the currently installed Speech Manager.
RESULT CODES
None
Making Some Noise
The most basic operation of the Speech Manager is accomplished by using the SpeakString call. This call passes a specific text string to be spoken to the Speech Manager.
SpeakString
The SpeakString function passes a specific text string to be spoken to the Speech Manager.
pascal OSErr SpeakString (StringPtr myString);
Field descriptions
myString Text string to be spoken
DESCRIPTION
SpeakString attempts to speak the Pascal-style text string contained in myString. Speech is produced asynchronously using the default system voice. When an application calls this function, the Speech Manager makes a copy of the passed string and creates any structures required to speak it. As soon as speaking has begun, control is returned to the application. The synthesized speech is generated transparently to the application so that normal processing can continue while the text is being spoken. No further interaction with the Speech Manager is required at this point, and the application is free to release or purge or trash the original string.
If SpeakString is called while a prior string is still being spoken, the audio currently being synthesized is interrupted immediately. Conversion of the new text into speech is then initiated. If an empty (zero length) string or a null string pointer is passed to SpeakString, it stops the synthesis of any prior string but does not generate any additional speech.
As with all Speech Manager routines that expect text arguments, the text may contain embedded speech control commands.
Result CodesnoErr 0 No error
memFullErr –108 Not enough memory to speak
synthOpenFailed –241 Could not open another speech synthesizer channel
Determining If Speaking Is Complete
Once an application starts a speech process with SpeakString, the next thing it will probably need to know is whether the string has been completely spoken. It can use SpeechBusy to determine whether or not the system is still speaking.
SpeechBusy
The SpeechBusy routine is useful when you want to ensure that an earlier speech request has been completed before having the system speak again.
pascal short SpeechBusy (void);
DESCRIPTION
SpeechBusy returns the number of channels of speech that are currently synthesizing speech in the application. If you use just SpeakString to initiate speech, SpeechBusy will always return 1 as long as speech is being produced. When SpeechBusy returns 0, all initiated speech has finished.
RESULT CODES
None
A Simple Example
The example shown in Listing 1-2 demonstrates how to use the routines introduced in this section. It first makes sure the Speech Manager is available. Then it starts speaking a string (hard-coded in this example, but more commonly loaded from a resource) and loops, doing some screen drawing, until the string is completely spoken. This example uses the SpeechAvailable routine shown in Listing 1-1.
Listing 1-2 Elementary Speech Manager calls
OSErr err;
if (SpeechAvailable()) {
err = SpeakString("\pThe cat sat on the mat.");
if (err == noErr)
while (SpeechBusy() > 0)
CoolAnimationRoutine();
else
NotSoCoolAlertRoutine(err);
}
Essential Calls—Simple and Useful
While the routines presented in the last section are simple to use, their applicability is limited to a few basic speech scenarios. This section describes additional routines that let you work with different voices and adjust some basic characteristics of the synthesized speech.
Working With Voices
When describing a person’s voice, we talk about the particular set of characteristics that help us to distinguish that person’s voice from another. For example, the rate at which one speaks (slow or fast) and the average pitch (high or low) characterize a particular speaker on a crude level. In the context of the Speech Manager, a voice is the set of parameters that specify a particular quality of synthesized speech. This portion of the Speech Manager is used to determine which voices are available and to select particular voices.
Every specific voice has a unique ID associated with it, which is the primary way an application refers to it. Within the Speech Manager a unique voice ID is called a VoiceSpec structure.
The Speech Manager provides two routines to count and step through the list of currently available voices. CountVoices is used to compute how many voices are available with the current system. GetIndVoice uses an index, starting at 1, to return information about all currently installed voices.
Use the GetIndVoice routine to step through the list of available voices. It will fill a VoiceSpec record that can be used to obtain descriptive information about the voice or to speak using that voice.
Any application that wishes to use multiple voices will probably need additional information about the available voices beyond the VoiceSpec structure, such as the name of the voice and perhaps what script and language each voice supports. This information might be presented to the user in a “voice picker” dialog box or voice menu, or it might be used internally by an application trying to find a voice that meets certain criteria. Applications can use the GetVoiceDescription routine for these purposes.
MakeVoiceSpec
To maximize compatibility with future versions of the Speech Manager, you should always use MakeVoiceSpec instead of setting the fields of the VoiceSpec structure directly.
OSType creator; // determines which synthesizer is required
OSType id; // voice ID on the specified synth
} VoiceSpec;
Field descriptions
creator The synthesizer required by your application
id Identification number for this voice
*voice Pointer to the VoiceSpec structure
DESCRIPTION
Most voice management routines expect to be passed a pointer to a VoiceSpec structure. MakeVoiceSpec is a utility routine provided to facilitate the creation of VoiceSpec records. On return, the passed VoiceSpec structure contains the appropriate values.
Voices are stored in resources of type 'ttsv' in the resource fork of Macintosh files. The Speech Manager uses the same search method as the Resource Manager, looking for voice resources in three different locations when attempting to resolve VoiceSpec references. It first looks in the application’s resource file chain. If the specified voice is not found in an open file, it then looks in the System Folder and the Extensions folder (or in just the System Folder under System 6) for files of type 'ttsv' (single-voice files) or 'ttsb' (multivoice files) and in text-to-speech synthesizer component files (file type 'INIT' or 'thng'). Voices stored in the System Folder or Extensions folder are normally available to all applications. Voices stored in the resource fork of an application files are private to the application.
RESULT CODEnoErr 0 No error
While the determination of specific voice ID values is mostly left to synthesizer developers, the voice creator values are specified by Apple (they would ordinarily correspond to a developer’s currently assigned creator ID). For both the creator and id fields Apple further reserves the set of OSType values specified entirely by space characters and lowercase letters. Apple is establishing a standard suite of voice ID values that developers can count upon being available with all speech synthesizers.
CountVoices
The CountVoices routine returns the number of voices available.
pascal OSErr CountVoices (short *voiceCount);
Field descriptions
voiceCount Number of voices available to the application
DESCRIPTION
Each time CountVoices is called, the Speech Manager searches for new voices. This algorithm supports dynamic installation of voices by applications or users. On return, the voiceCount parameter contains the number of voices available.
RESULT CODESnoErr 0 No error
GetIndVoice
The GetIndVoice routine returns information about a specific voice.
As with all other index-based routines in the Macintosh Toolbox, an index value of 1 causes GetIndVoice to return information for the first voice. The order that voices are returned is not presently defined and should not be assumed. Speech Manager behavior when voice files or resources are added, removed, or modified is also presently undefined. However, calling CountVoices or GetIndVoice with an index of 1 will force the Speech Manager to update its list of available voices. GetIndVoice will return a voiceNotFound error if the passed index value exceeds the number of available voices.
RESULT CODESnoErr 0 No error
voiceNotFound –244 Voice resource not found
GetVoiceDescription
The GetVoiceDescription routine returns information about a voice beyond that provided by GetIndVoice.
enum {kNeuter = 0, kMale, kFemale}; // returned in gender field below
typedef struct VoiceDescription {
long length; // size of structure
VoiceSpec voice; // synth and ID info for voice
long version; // version code for voice
Str63 name; // name of voice
Str255 comment; // additional text info about voice
short gender; // neuter, male, or female
short age; // approximate age in years
short script; // script code of text voice can process
short language; // language code of voice output speech
short region; // region code of voice output speech
long reserved[4]; // always zero - reserved
} VoiceDescription;
Field descriptions
*voice Pointer to the VoiceSpec structure
*info Pointer to structure containing parameters for the specified voice
infoLength Length in bytes of info structure
DESCRIPTION
The Speech Manager fills out the passed VoiceDescription fields with the correct information for the specified voice. If a null VoiceSpec pointer is passed, the Speech Manager returns information for the system default voice. If the VoiceDescription pointer is null, the Speech Manager simply verifies that the specified VoiceSpec refers to an available voice. If VoiceSpec does not refer to a known voice, GetVoiceDescription returns a voiceNotFound error, as shown in Listing 1-3.
To maximize compatibility with future versions of the Speech Manager, the application must pass the size of the VoiceDescription structure. Having the application do this ensures that the Speech Manager will never write more data into the passed structure than will fit even if additional information fields are defined in the future. On returning from GetVoiceDescription, the length field is set to reflect the length of data actually written by this routine.
Listing 1-3 Getting information about a voice
OSErr GetVoiceGender (VoiceSpec *voicePtr, short *gender) {
OSErr err;
VoiceDescription vd;
err = GetVoiceDescription
(voicePtr,&vd,sizeof(VoiceDescription));
if (err == noErr) {
if (vd.length > offsetof(VoiceDescription,gender))
*gender = vd.gender;
else
err = badStructLen;
}
return err;
}
RESULT CODESnoErr 0 No error
paramErr –50 Parameter error
memFullErr –108 Not enough memory to load voice into memory
voiceNotFound –244 Voice resource not found
Managing Connections to Speech Synthesizers
Using the routines described earlier in this document, an application can select the voice with which to speak. The next step is to associate the selected voice with the proper speech synthesizer. This is accomplished by creating a new speech channel with the NewSpeechChannel routine. A speech channel is a private communication connection to the speech synthesizer, much as a file reference number is a communication channel to an open file in the Macintosh file system.
The DisposeSpeechChannel routine closes a speech channel when the application is finished with it and releases any resources that have been allocated to support the speech synthesizer and are no longer needed.
NewSpeechChannel
The NewSpeechChannel routine creates a new speech channel.
pascal OSErr NewSpeechChannel (VoiceSpec *voice,
SpeechChannel *chan);
Field descriptions
*voice Pointer to the VoiceSpec structure
*chan Pointer to the new channel
DESCRIPTION
The Speech Manager automatically locates and opens a connection to the proper synthesizer for a specified voice and sets up a channel at the location pointed to by *chan so that it is ready to speak with that voice. If a null VoiceSpec pointer is passed to NewSpeechChannel, the Speech Manager uses the current system default voice.
There is no predefined limit to the number of speech channels an application may create. However, system constraints on available RAM, processor loading, and number of available sound channels may limit the number of speech channels actually possible.
RESULT CODESnoErr 0 No error
memFullErr –108 Not enough memory to open speech channel
synthOpenFailed –241 Could not open another speech synthesizer channel
voiceNotFound –244 Voice resource not found
DisposeSpeechChannel
The DisposeSpeechChannel routine disposes of an existing speech channel.
This routine disposes of an existing speech channel. Any speech channels that have not been explicitly disposed of by the application are released automatically by the Speech Manager when the application quits.
All the remaining routines in this section require a valid speech channel to work properly. Once the application has successfully created a speech channel, it can start to speak. You use the SpeakText routine to begin speaking on a speech channel.
At any time during the speaking process, the application can stop the synthesizer’s speech. The StopSpeech routine will immediately abort any speech being produced on the specified speech channel and force the channel back into an idle state.
SpeakText
The SpeakText routine converts a designated text into speech.
pascal OSErr SpeakText (SpeechChannel chan, Ptr textBuf, long
byteLength);
Field descriptions
chan Specific speech channel
textBuf Buffer of text
byteLength Length of textBuf
DESCRIPTION
In addition to a valid speech channel, SpeakText expects a pointer to the text to be spoken and the length in bytes of the text buffer. SpeakText will convert the given text stream into speech using the voice and control settings for that speech channel. The speech is generated asynchronously. This means that control is returned to your application before the speech has finished (probably even before it has begun). The maximum length of text buffer that can be spoken is limited only by the available RAM. However, it’s generally not very friendly to force the user to listen to long uninterrupted text unless the user requests it.
If SpeakText is called while it is currently busy speaking the contents of a prior text buffer, it will immediately stop speaking from the prior buffer and will begin speaking from the new text buffer as soon as possible. As with SpeakString, described on page 5, if an empty (zero length) string or a null text buffer pointer is passed to SpeakText, this will have the effect of stopping the synthesis of any prior text but will not generate any additional speech.
sWARNING
With SpeakText, unlike with SpeakString, the text buffer must be locked in memory and must not move during the entire time that it is being converted into speech. This buffer is read at interrupt time, and very undesirable effects will happen if it moves or is purged from memory.s
The StopSpeech routine terminates speech delivery on a specified channel.
pascal OSErr StopSpeech (SpeechChannel chan);
Field descriptions
chan Specific speech channel
DESCRIPTION
After returning from StopSpeech, the application can safely release any text buffer that that the speech synthesizer has been using. The SpeechBusy routine, described on page 6, can be used to determine if the text has been completely spoken. (In an environment with multiple speech channels, you may need to use the more advanced status routine GetSpeechInfo, described on page 25, to determine if a specific channel is still speaking.) StopSpeech can be called for an already idle channel without ill effect.
The Speech Manager provides several methods of adjusting the variables that can affect the way speech is synthesized. Although most applications probably do not need to use these advanced features, two of the speech variables, speaking rate and speaking pitch, are useful enough that a very simple way of adjusting these parameters on a channel-by-channel basis is provided. Routines are supplied that enable an application to both set and get these parameters. However, the audible effects of changing the rate and pitch of speech vary from synthesizer to synthesizer; you should test the actual results on all synthesizers with which your application may work.
Speaking rates are specified in terms of words per minute (WPM). While this unit of measurement is difficult to define in any precise way, it is generally easy to understand and use. The range of supported rates is not predefined by the Speech Manager. Each speech synthesizer provides its own range of speaking rates. Furthermore, any specific rate value will correspond to slightly different rates with different synthesizers.
Speaking pitches are defined on a musical scale that corresponds to the keys on a standard piano keyboard. By convention, pitches are represented as fixed-point values in the range from 0.000 through 100.000, where 60.000 corresponds to middle C (261.625 Hz) on a conventional piano. Pitches are represented on a logarithmic scale. On this scale, a change of +12 units corresponds to doubling the frequency, while a change of –12 units corresponds to halving the frequency. For a further discussion of pitch values, see “Getting Information About a Speech Channel,” later in this document.
Typical voice frequencies might range from around 90 Hertz for a low-pitched male voice to perhaps 300 Hertz for a high-pitched child’s voice. These frequencies correspond to pitch values of 41.526 and 53.526, respectively.
Changes in speech rate and pitch are effective immediately (as soon as the synthesizer can respond), even if they occur in the middle of a word.
SetSpeechRate
The SetSpeechRate routine sets the speaking rate on a designated speech channel.
The SetSpeechRate routine is used to adjust the speaking rate on a speech channel. The rate parameter is specified as a fixed-point, words per minute value. As a general rule of thumb, “normal” speaking rates range from around 150 WPM to around 180 WPM. It is important when working with speaking rates, however, to keep in mind that users will differ greatly in their ability to understand synthesized speech at a particular rate based upon their level of experience listening to the voice and their ability to anticipate the types of utterances they will encounter.
The code fragment in Listing 1-4 illustrates many of the routines introduced in this section. The example steps through the list of available voices to find the first female voice. Then it creates a new speech channel and begins speaking. While the voice is speaking, the pitch of the voice is continually adjusted around the original pitch. If the mouse button is pressed while the voice is speaking, the code halts the speech and exits. This example uses the SpeechAvailable and GetVoiceGender routines shown earlier in Listing 1-1 and Listing 1-3.
Listing 1-4 Putting it all together
OSErr err;
Str255 myStr = "\pThe bat sat on my hat.";
VoiceSpec voice;
VoiceDescription vd;
Boolean gotVoice = FALSE;
short voiceCount, gender, i;
SpeechChannel chan;
Fixed origPitch, newPitch;
if (myStr[0] && SpeechAvailable()) {
err = CountVoices(&voiceCount); // count the available voices
i = 1;
while ((i <= voiceCount) && ((err=GetIndVoice(i++, &voice))
==noErr))
{
err = GetIndVoice(i++, &voice)) == noErr;
err = GetVoiceGender(&voice, &gender);
if ((err == noErr) && (gender == kFemale)) {
gotVoice = TRUE;
break;
}
}
if (gotVoice) {
err = NewSpeechChannel(&voice, &chan);
if (err == noErr) {
err = GetSpeechPitch(chan, &origPitch); // cur pitch
if (err == noErr)
err = SpeakText(chan, &myStr[1], myStr[0]);
i = 0;
if (err == noErr)
while (SpeechBusy() > 0) {
CoolAnimationRoutine();
newPitch = (i - 4) << 16; // fixed pitch offset
newPitch += origPitch;
i = (i + 1) & 7; // steps from 0 to 7 repeatedly
err = SetSpeechPitch(chan, newPitch);
if ((err != noErr) || Button()) {
err = StopSpeech(chan);
break;
}
}
err = DisposeSpeechChannel(chan);
}
}
if (err != noErr)
NotSoCoolAlertRoutine(err);
Advanced Routines
This section describes several advanced or rarely-used Speech Manager routines. You can use them to improve the quality of your application’s speech.
Advanced Speech Controls
The StopSpeech routine, described in “Starting and Stopping Speech,” earlier in this document, provides a simple way to interrupt any speech output instantly. In some situations it is preferable to be able to stop speech production at the next natural boundary, such as the next word or the end of the current sentence. StopSpeechAt provides that capability.
Similarly, the PauseSpeechAt routine causes speech to pause at a specified point in the text being spoken; the ContinueSpeech routine resumes speech after it has paused.
In addition to SpeakString and SpeakText, described earlier in this document, the Speech Manager provides a third, more general routine. SpeakBuffer is the low-level speech routine upon which the other two are built. SpeakBuffer provides greater control through the use of an additional flags parameter.
The SpeechBusySystemWide routine tells you if any speech is currently being synthesized in your application or elsewhere on the computer.
StopSpeechAt
The StopSpeechAt routine halts speech at a specific point in the text being spoken.
pascal OSErr StopSpeechAt (SpeechChannel chan, long whereToStop);
enum {
kImmediate = 0,
kEndOfWord = 1,
kEndOfSentence = 2
};
Field descriptions
chan Specific speech channel
whereToStop Location in text at which speech is to stop
DESCRIPTION
StopSpeechAt is used to halt the production of speech at a specified point in the text. The whereToStop argument should be set to one of the following constants:
n The kImmediate constant stops speech output immediately.
n The kEndOfWord constant lets speech continue until the current word has been spoken.
n The kEndOfSentence constant lets speech continue until the end of the current sentence has been reached.
This routine returns immediately, although speech output continues until the specified point has been reached.
sWARNING
You must not release the memory associated with the current text buffer until the channel status indicates that the speech channel output is no longer busy.s
If the end of the input text buffer is reached before the specified stopping point, the speech synthesizer will stop at this point. Once the stopping point has been reached, the application is free to release the text buffer. Calling StopSpeechAt with whereToStop equal to kImmediate is equivalent to calling StopSpeech, described on page 14.
Contrast the StopSpeechAt routine with PauseSpeech, described next.
The PauseSpeechAt routine causes speech to pause at a specified point in the text being spoken.
pascal OSErr PauseSpeechAt (SpeechChannel chan,
long whereToPause);
enum {
kImmediate = 0,
kEndOfWord = 1,
kEndOfSentence = 2
};
Field descriptions
chan Specific speech channel
whereToPause Location in text at which speech is to pause
DESCRIPTION
PauseSpeech makes speech production pause at a specified point in the text. The whereToPause parameter should be set to one of these constants:
n The kImmediate constant stops speech output immediately.
n The kEndOfWord constant lets speech continue until the current word has been spoken.
n The kEndOfSentence constant lets speech continue until the end of the current sentence has been reached.
When the specified point is reached, the speech channel enters the paused state, reflected in the channel’s status. PauseSpeechAt returns immediately, although speech output will continue until the specified point.
If the end of the input text buffer is reached before the specified pause point, speech output pauses at the end of the buffer.
PauseSpeechAt differs from StopSpeech and StopSpeechAt in that a subsequent call to ContinueSpeech, described next, causes the contents of the current text buffer to continue being spoken.
sWARNING
While in a paused state, the last text buffer must remain available at all times and must not move. While paused, the SpeechChannel status indicates outputBusy = true and outputPaused = true.s
The ContinueSpeech routine resumes speech after it has been halted by the PauseSpeechAt routine.
pascal OSErr ContinueSpeech (SpeechChannel chan);
Field descriptions
chan Specific speech channel
DESCRIPTION
At any time after PauseSpeechAt is called, ContinueSpeech may be called to continue speaking from the point at which speech paused. Calling ContinueSpeech on a channel that is not currently in a pause state has no effect; calling it before a pause is effective cancels the pause.
controlFlags Control flags to control speech behavior
DESCRIPTION
When the controlFlags parameter is set to 0, SpeakBuffer behaves identically to SpeakText, described on page 13.
The kNoEndingProsody flag bit is used to control whether or not the speech synthesizer automatically applies ending prosody, the speech tone and cadence that normally occur at the end of a statement. Under normal circumstances (for example, when the flag bit is not set), ending prosody is applied to the speech when the end of the textBuf data is reached. This default behavior can be disabled by setting the kNoEndingProsody flag bit.
Some synthesizers do not speak until the kNoEndingProsody flag bit is reset, or they encounter a period in the text, or textBuf is full.
The kNoSpeechInterrupt flag bit is used to control the behavior of SpeakBuffer when called on a speech channel that is still busy. When the flag bit is not set, SpeakBuffer behaves similarly to SpeakString and SpeakText, described earlier in this document. Any speech currently being produced on the specified speech channel is immediately interrupted and then the new text buffer is spoken. When the kNoSpeechInterrupt flag bit is set, however, a request to speak on a channel that is still busy processing a prior text buffer will result in an error. The new buffer is ignored and the error synthNotReady is returned. If the prior text buffer has been fully processed, the new buffer is spoken normally.
The kPreflightThenPause flag bit is used to minimize the latency experienced when attempting to speak. Ordinarily whenever a call to SpeakString, SpeakText, or SpeakBuffer is made, the speech synthesizer must perform a certain amount of initial processing before speech output is heard. This startup latency can vary from a few milliseconds to several seconds depending upon which speech synthesizer is being used. Recognizing that larger startup delays may be detrimental to certain applications, a mechanism is provided to provide the synthesizer a chance to perform any necessary computations at noncritical times. Once the computations have been completed, the speech is able to start instantly. When the kPreflightThenPause flag bit is set, the speech synthesizer will process the input text as necessary to the point where it is ready to begin producing speech output. At this point, the synthesizer will enter a paused state and return to the caller. When the application is ready to produce speech, it should call the ContinueSpeech routine to begin speaking.
RESULT CODESnoErr 0 No error
synthNotReady –242 Speech channel is still busy speaking
You can use SpeechBusySystemWide to determine if any speech is currently being synthesized in your application or elsewhere on the computer.
pascal short SpeechBusySystemWide (void);
DESCRIPTION
This routine is useful when you want to ensure that no speech is currently being produced anywhere on the Macintosh computer. SpeechBusySystemWide returns the total number of speech channels currently synthesizing speech on the computer, whether they were initiated by your code or by some other process executing concurrently.
RESULT CODES
None
Converting Text Into Phonemes
In some situations it is desirable to convert a text string into its equivalent phonemic representation. This may be useful during the content development process to fine-tune the pronunciation of particular words or phrases. By first converting the target phrase into phonemes, you can see what the synthesizer will try to speak. Then you need only correct the parts that would not have been spoken the way you want.
TextToPhonemes
The TextToPhonemes routine converts a designated text to phoneme codes.
*phonemeBytes Pointer to length of phonemeBuf in bytes
DESCRIPTION
It may be useful to convert your text into phonemes during application development in order to be able to reduce the amount of memory required to speak. If your application does not require the text-to-phoneme conversion portion of the speech synthesizer, significantly less RAM may be required to speak with some synthesizers. Additionally, you may be able to use a higher quality text-to-phoneme conversion process (even one that does not work in real time) to generate precise phonemic information. This data can then be used with any speech synthesizer to produce better speech.
TextToPhonemes accepts a valid SpeechChannel parameter, a pointer to the characters to be converted into phonemes, the length of the input text buffer in bytes, an application-supplied handle into which the converted phonemes can be written, and a length parameter. On return, the phonemeBytes argument is set to the number of phoneme character bytes that were written into phonemeBuf. The data returned by TextToPhonemes will correspond precisely to the phonemes that would be spoken had the input text been sent to SpeakText instead. All current mode settings are applied to the converted speech. No callbacks are generated while the TextToPhonemes routine is generating its output.
RESULT CODESnoErr 0 No error
paramErr –50 Parameter value is invalid
nilHandleErr –109 Handle argument is nil
siUnknownInfoType –231 Feature not implemented on synthesizer
Several additional types of information have been made available for advanced users of the Speech Manager. This information provides more detailed status information for each channel. You can get this information by calling the GetSpeechInfo routine. This function accepts selectors that determine the type of information you want to get.
Note
Throughout this document, there are several references to parameter values specified with fixed-point integer values (pbas, pmod, rate, and volm). Unless otherwise stated, the full range of values of the Fixed data type is valid. However, it is left to the individual speech synthesizer implementation to determine whether or not to use the full resolution and range of the Fixed data type. In the event a specified parameter value lies outside the range supported by a particular synthesizer, the synthesizer will substitute the value closest to the specified value that does lie within its performance specifications.u
GetSpeechInfo
The GetSpeechInfo routine returns information about a designated speech channel.
soInputMode = 'inpt', // gets current text/phon mode
soCharacterMode = 'char', // gets current character mode
soNumberMode = 'nmbr', // gets current number mode
soRate = 'rate', // gets current speaking rate
soPitchBase = 'pbas', // gets current baseline pitch
soPitchMod = 'pmod', // gets current pitch modulation
soVolume = 'volm', // gets current speaking volume
soSynthType = 'vers', // gets speech synth version info
soRecentSync = 'sync', // gets most recent sync message info
soPhonemeSymbols = 'phsy', // gets phoneme symbols & ex. words
soSynthExtension = 'xtnd' // gets synthesizer-specific info
};
Field descriptions
chan Specific speech channel
selector Used to specify data being requested
*speechInfo Pointer to an information structure
DESCRIPTION
The following list of selectors describes the various types of information that can be obtained from the Speech Manager. The format of the information returned depends on which value is used in the selector field, as follows:
Note
For future code compatibility, use the application programming interface (API) labels instead of literal selector values.u
Field descriptions
stat Gets various items of status information for the specified channel. Indicates whether any speech audio is being generated, whether or not the channel has paused, how many bytes in the input text have yet to be processed, and the phoneme code for the phoneme that is currently being generated. If inputBytesLeft is 0, the input buffer is no longer needed and can be disposed of. The API label for this selector is soStatus.
typedef SpeechStatusInfo *speechInfo;
typedef struct SpeechStatusInfo {
Boolean outputBusy; // true = audio playing
Boolean ouputPaused; // true = channel paused
long inputBytesLeft; // bytes left to process
short phonemeCode; // current phoneme code
} SpeechStatusInfo;
erro Gets saved error information and clears the error registers. This selector lets you poll for various run-time errors that occur during speaking, such as the detection of badly formed embedded commands. Errors returned directly by Speech Manager routines are not reported here. The count field shows how many errors have occurred since the last check. If count is 0 or 1, then oldest and newest will be the same. Otherwise, oldest contains the error code for the oldest unread error and newest contains the error code for the most recent error. Both oldPos and newPos contain the character positions of their respective errors in the original input text buffer. The API label for this selector is soErrors.
typedef SpeechErrorInfo *speechInfo;
typedef struct SpeechErrorInfo {
short count; // # of errs since last check
OSErr oldest; // oldest unread error
long oldPos; // char position of oldest err
OSErr newest; // most recent error
long newPos; // char position of newest err
} SpeechErrorInfo;
inpt Gets the current value of the text processing mode control. The returned value specifies whether the specified speech channel is currently in text-input mode (TEXT) or phoneme-input mode (PHON). The API label for this selector is soInputMode.
typedef OSType *speechInfo; // TEXT or PHON
char Gets the current value of the character processing mode control. The returned value specifies whether the specified speech channel is currently processing input characters in normal mode (NORM) or in literal, letter-by-letter, mode (LTRL). The API label for this selector is soCharacterMode.
typedef OSType *speechInfo; // NORM or LTRL
nmbr Gets the current value of the number processing mode control. The returned value specifies whether the specified speech channel is currently processing input character digits in normal mode (NORM) or in literal, digit-by-digit, mode (LTRL). The API label for this selector is soNumberMode.
typedef OSType *speechInfo; // NORM or LTRL
rate Gets the current speaking rate in words per minute on the specified channel. Speaking rates are fixed-point values. The API label for this selector is soRate.
typedef Fixed *speechInfo;
Note
Words per minute is a convenient, if difficult to define, way of representing speaking rate. Although there is no universally accepted definition of words per minute, it does communicate approximate information about speaking rates. Any specific rate may correspond to different rates on different synthesizers, but the two rates should be reasonably close. More importantly, doubling the rate on a particular synthesizer should halve the time needed to speak any particular utterance.u
pbas Gets the current baseline pitch for the specified channel. The pitch value is a fixed-point integer that conforms to the following frequency relationship:
Hertz = 440.0 * 2((BasePitch - 69) / 12)
BasePitch of 1.0 ≈ 9 Hertz
BasePitch of 39.5 ≈ 80 Hertz
BasePitch of 45.8 ≈ 115 Hertz
BasePitch of 50.4 ≈ 150 Hertz
BasePitch of 100.0 ≈ 2637 Hertz
BasePitch values are always positive numbers in the range from 1.0 through 100.0. The API label for this selector is soPitchBase.
typedef Fixed *speechInfo;
pmod Gets the current pitch modulation range for the speech channel. Modulation values range from 0.0 through 100.0. A value of 0.0 corresponds to no modulation and means the channel will speak in a monotone. The API label for this selector is soPitchMod.
Nonzero modulation values correspond to pitch and frequency deviations according to the following formula:
Maximum pitch = BasePitch + PitchMod
Minimum pitch = BasePitch - PitchMod
Maximum Hertz = BaseHertz * 2(+ ModValue / 12)
Minimum Hertz = BaseHertz * 2(- ModValue / 12)
Given:
BasePitch of 46.0 (≈ 115 Hertz),
PitchMod of 2.0,
Then:
Maximum pitch = 48.0 (≈131 Hertz),
Minimum pitch = 44.0 (≈104 Hertz)
typedef Fixed *speechInfo;
volm Gets the current setting of the volume control on the specified channel. Volumes are expressed in fixed-point units ranging from 0.0 through 1.0. A value of 0.0 corresponds to silence, and a value of 1.0 corresponds to the maximum possible volume. Volume units lie on a scale that is linear with amplitude or voltage. A doubling of perceived loudness corresponds to a doubling of the volume. The API label for this selector is soVolume.
typedef Fixed *speechInfo;
vers Gets descriptive information for the type of speech synthesizer being used on the specified speech channel. The API label for this selector is soSynthType.
typedef SpeechVersionInfo *speechInfo;
typedef struct SpeechVersionInfo {
OSType synthType; // always 'ttsc'
OSType synthSubType; // synth flavor
OSType synthManufacturer; // synth creator
long synthFlags; // reserved
NumVersion synthVersion; // synth version
} SpeechVersionInfo;
sync Returns the sync message code for the most recently encountered embedded sync command at the audio output point. If no sync command has been encountered, 0 is returned. Refer to the section “Embedded Speech Commands,” later in this document, for information about sync commands. The API label for this selector is soRecentSync.
typedef OSType *speechInfo;
phsy Returns a list of phoneme symbols and example words defined for the current synthesizer. The input parameter is the address of a handle variable. On return, the PhonemeDescriptor parameter contains a handle to the array of phoneme definitions. Make sure to dispose of the handle when you are done using it. This information is normally used to indicate to the user the approximate sounds corresponding to various phonemes—an important feature in international speech. The API label for this selector is soPhonemeSymbols.
typedef PhonemeDescriptor ***speechInfo; // VAR
Handle
typedef struct PhonemeInfo {
short opcode; // opcode for the phoneme
Str15 phStr; // corresponding char string
Str31 exampleStr; // word that shows use of
// phoneme
short hiliteStart; // part of example word
// to be hilighted as in
short hiliteEnd; // TextEdit selections
} PhonemeInfo;
typedef struct PhonemeDescriptor {
short phonemeCount; // # of elements
PhonemeInfo thePhonemes[1]; // element list
} PhonemeDescriptor;
xtnd This call supports a general method for extending the functionality of the Speech Manager. It is used to get synthesizer-specific information. The format of the returned data is determined by the specific synthesizer queried. The speechInfo argument should be a pointer to the proper data structure. If a particular synthCreator value is not recognized by the synthesizer, the command is ignored and the siUnknownInfoType code is returned. The API label for this selector is soSynthExtension.
typedef SpeechXtndData *speechInfo;
typedef struct SpeechXtndData {
OSType synthCreator; // synth creator ID
Byte synthData[2]; // data TBD by synth
} SpeechXtndData;
RESULT CODESnoErr 0 No error
siUnknownInfoType –231 Feature is not implemented on synthesizer
The Speech Manager provides numerous control features for sophisticated developers. These controls enable you to set various speaking parameters programmatically and provide a rich set of callback routines that can be used to notify applications of various conditions within the speaking process. They are extended by many speech synthesizers.
These controls are accessed with the SetSpeechInfo routine. All calls to this routine expect a SpeechChannel parameter, a selector to indicate the desired function, and a pointer to some data. The format of this data depends on the particular selector and is documented in the following routine description.
SetSpeechInfo
The SetSpeechInfo routine sets information for a designated speech channel.
soSynthExtension = 'xtnd' // synthesizer-specific info
};
Field descriptions
chan Specific speech channel
selector Used to specify data being requested
*speechInfo Pointer to an information structure
DESCRIPTION
The following list of selectors outlines the controls available with the Speech Manager. The format of the information returned depends on which value is used in the selector field, as follows:
Note
The Speech Manager supports several callback features that can provide the sophisticated developer with a tight coupling to the speech synthesis process. However, these callbacks must be used carefully. Each is invoked from interrupt level. This means that you may not perform any operations that might cause memory to be allocated, purged, or moved. Although application global variables are also ordinarily not accessible at interrupt time, the soCurrentA5 myA5 selector described in the following text can be used to ask the Speech Manager to point register A5 at your application’s global variables prior to each callback. This makes it fairly painless to access global variables from your callback handlers. If this information worries you, don’t despair. Most information available through callbacks is also available through a GetSpeechInfo call. These calls are more friendly and do not come with the constraints imposed upon callback code. The only drawback is that if you do not poll the information you are interested in often enough, you may miss some of the changes in your speech channel’s status.u
Field descriptions
inpt Sets the current value of the text processing mode control. The passed value specifies whether the speech channel should be in text-input mode (TEXT) or phoneme-input mode (PHON). Input mode changes take effect as soon as possible; however, the precise latency is dependent upon the specific speech synthesizer. The API label for this selector is soInputMode.
typedef OSType *speechInfo; // TEXT or PHON
char Sets the current value of the character processing mode control. The passed value specifies whether the speech channel should be in normal character processing mode (NORM) or literal, letter-by-letter, mode (LTRL). Character mode changes take effect as soon as possible; however, the precise latency is dependent upon the specific speech synthesizer. The API label for this selector is soCharacterMode.
typedef OSType *speechInfo; // NORM or LTRL
nmbr Sets the current value of the number processing mode control. The passed value specifies whether the specified speech channel should be in normal number processing mode (NORM) or in literal, digit-by-digit, mode (LTRL). The number mode changes take effect as soon as possible. However, the precise latency is dependent upon the specific speech synthesizer. The API label for this selector is soNumberMode.
typedef OSType *speechInfo; // NORM or LTRL
rate Sets the speaking rate in words per minute on the specified channel. Speaking rates are fixed-point values. All values are valid; however, specific synthesizers will not necessarily be able to speak at all possible rates. The API label for this selector is soRate.
typedef Fixed *speechInfo;
pbas Changes the current baseline pitch for the specified channel. The pitch value is a fixed-point integer that conforms to the following frequency relationship:
Hertz = 440.0 * 2((BasePitch - 69) / 12)
BasePitch of 1.0 ≈ 9 Hertz
BasePitch of 39.5 ≈ 80 Hertz
BasePitch of 45.8 ≈ 115 Hertz
BasePitch of 50.4 ≈ 150 Hertz
BasePitch of 100.0 ≈ 2637 Hertz
BasePitch values are always positive numbers in the range from 1.0 through 100.0.
typedef Fixed *speechInfo;
The API label for this selector is soPitchBase.
pmod Changes the current pitch modulation range for the speech channel. Modulation values range from 0.0 through 100.0. A value of 0.0 corresponds to no modulation and means the channel will speak in a monotone. Nonzero modulation values correspond to pitch and frequency deviations according to the following formula:
Maximum pitch = BasePitch + PitchMod
Minimum pitch = BasePitch - PitchMod
Maximum Hertz = BaseHertz * 2(+ ModValue / 12)
Minimum Hertz = BaseHertz * 2(- ModValue / 12)
Given :
BasePitch of 46.0 (≈115 Hertz),
PitchMod of 2.0,
Then:
Maximum pitch = 48.0 (≈131 Hertz),
Minimum pitch = 46.0 (≈104 Hertz)
typedef Fixed *speechInfo;
The API label for this selector is soPitchMod.
volm Changes the current speaking volume on the specified channel. Volumes are expressed in fixed-point units ranging from 0.0 through 1.0 . A value of 0.0 corresponds to silence, and a value of 1.0 corresponds to the maximum possible volume. Volume units lie on a scale that is linear with amplitude or voltage. A doubling of perceived loudness corresponds to a doubling of the volume. The API label for this selector is soVolume.
typedef Fixed *speechInfo;
cvox Changes the current voice on the current speech channel to the specified voice. Note that this control call will return an incompatibleVoice error if the specified voice is incompatible with the speech synthesizer associated with the speech channel. The API label for this selector is soCurrentVoice.
typedef VoiceSpec *speechInfo;
dlim Sets the delimiter character strings for embedded commands. The start of an embedded command is determined by comparing the input characters to the start-command delimiter string. Likewise, the end of a command is determined by comparing the input characters to the end-command delimiter string. Command delimiter strings are either 1 or 2 bytes in length. If a single byte delimiter is desired, it should be followed by a null (0) byte. Delimiter characters must come from the set of printable characters. If the delimiter strings are empty, this will have the effect of disabling embedded command processing. Care must be taken not to choose delimiter strings that might occur naturally in the text to be spoken. The API label for this selector is soCommandDelimiter.
typedef DelimiterInfo *speechInfo;
typedef struct DelimiterInfo {
Byte startDelimiter[2]; // defaults to "[["
Byte endDelimiter[2]; // defaults to "]]"
} DelimiterInfo;
rset Resets the speech channel to its default states. The speechInfo parameter should be set to 0. Specific synthesizers may provide other reset capabilities. The API label for this selector is soReset.
typedef long *speechInfo;
myA5 An application uses this selector to request that the speech synthesizer set up an A5 world prior to all callbacks. In order for an application to access any of its global data, it is necessary that register A5 contain the correct value, since all global variables are referenced relative to register A5. If you pass a non-null value in the speechInfo parameter, the speech synthesizer will set register A5 to this value just before it calls one of your callback routines. The A5 register is restored to its original value when your callback routine returns. The API label for this selector is soCurrentA5.
typedef Ptr speechInfo;
A typical application would make the call to SetSpeechInfo with code like the following:
refc Sets the reference constant associated with the specified channel. All callbacks generated for this channel will return this reference constant for use by the application. The application can use this value any way it wants to. The API label for this selector is soRefCon.
typedef long *speechInfo;
tdcb Enables the callback that signals that text input processing is done. Your callback routine is invoked when the current buffer of input text has been processed and is no longer needed by the speech synthesizer. This callback does not indicate that the synthesizer is finished speaking the text (see the sdcb callback description, next), merely that the input text has been fully processed and is no longer needed by the speech synthesizer. This callback can be disabled by passing a null ProcPtr in the speechInfo parameter. When your callback routine is invoked, you have two options. If you set the nextBuf, byteLen, and controlFlags variables before returning, you will enable the speech synthesizer to continue speaking without any interruption in the output. If you set the nextBuf parameter to null, you are indicating that you have no more text to speak. The controlFlags parameter is defined as in SpeakBuffer. The API label for this selector is soTextDoneCallBack.
typedef Ptr speechInfo;
pascal void MyInputDoneCallback (SpeechChannel
chan, long refCon, Ptr *nextBuf,
long *byteLen, long *controlFlags);
sdcb Enables an end-of-speech callback. Your callback routine is called whenever an input text stream has been completely processed and spoken. When your callback routine is invoked, you can be certain that the speech channel is now idle and no audio is being generated. This callback can be disabled by passing a null ProcPtr in the speechInfo parameter. The API label for this selector is soSpeechDoneCallBack.
typedef Ptr speechInfo;
pascal void MyEndOfSpeechCallback (SpeechChannel
chan, long refCon);
sycb Enables the sync command callback. Your callback routine is invoked when the text following a sync embedded command is about to be spoken. This callback can be disabled by passing a null ProcPtr in the speechInfo parameter. See “Embedded Speech Commands,” later in this document, for a description of how to use sync commands. The API label for this selector is soSyncCallBack.
typedef Ptr speechInfo;
pascal void MySyncCommandCallback (SpeechChannel
chan, long refCon, OSType syncMessage);
ercb Enables error callbacks. Your callback routine is called whenever an error occurs during the processing of an input text stream. Errors can result from syntax problems in the input text, insufficient CPU processing speed (such as an audio data underrun), or other conditions that may arise during the speech conversion process. If error callbacks have not been enabled, when an error condition is detected, the Speech Manager will save its value. The error codes can then be read using the GetSpeechInfo status selector soErrors (erro). The error callback can be disabled by passing a null ProcPtr in the speechInfo parameter. The API label for this selector is soErrorCallBack.
typedef Ptr speechInfo;
pascal void MyErrorCallback (SpeechChannel chan,
long refCon, OSErr error, long bytePos);
phcb Enables phoneme callbacks. Your callback routine is invoked for each phoneme generated by the speech synthesizer just before the phoneme is actually spoken. This callback can be disabled by passing a null ProcPtr in the speechInfo parameter. The API label for this selector is soPhonemeCallBack.
wdcb Enables word callbacks. Your callback routine is invoked for each word generated by the speech synthesizer just before the word is actually spoken. This callback can be disabled by passing a nil ProcPtr in the speechInfo parameter. The API label for this selector is soWordCallBack.
typedef Ptr speechInfo;
pascal void MyWordCallback (SpeechChannel chan,
long refCon, long wordPos, short wordLen);
xtnd This call supports a general method for extending the functionality of the Speech Manager. It is used to set synthesizer-specific information. The speechInfo argument should be a pointer to the appropriate data structure. If a particular synthCreator value is not recognized by the synthesizer, the command is ignored and an siUnknownInfoType code is returned. The API label for this selector is soSynthExtension.
typedef SpeechXtndData *speechInfo;
typedef struct SpeechXtndData {
OSType synthCreator; // synth creator ID
Byte synthData[2]; // data TBD by synth
} SpeechXtndData;
RESULT CODESnoErr 0 No error
paramErr –50 Parameter value is invalid
siUnknownInfoType –231 Feature is not implemented on synthesizer
incompatibleVoice –245 Specified voice cannot be used with synthesizer
No matter how sophisticated a speech synthesis system is, there will always be words that it does not automatically pronounce correctly. The clearest instance of words that are often mispronounced is the class of proper names (names of people, place names, and so on).
One way to get around this fundamental limitation is to use a dictionary of pronunciations. Whenever a speech synthesizer needs to determine the proper phonemic representation for a particular word, it first looks for the word in its dictionaries. Pronunciation dictionary entries contain information that enables precise conversion between text and the correct phoneme codes. They also provide stress, intonation, and other information to help speech synthesizers produce more natural speech. If the word in question is found in the dictionary, then the synthesizer uses the information from the dictionary entry rather than relying on its own letter-to-sound rules. The use of phonemes is described in “Summary of Phonemes and Prosodic Controls,” later in this document.
The Speech Manager word storage format provides high-quality data that is interchangeable between speech synthesizers. The Speech Manager also uses an easily extensible dictionary structure that does not affect the usability of existing dictionaries.
It is assumed that application-defined pronunciation dictionaries will reside in RAM when in use. The run-time structure of dictionary data presumably depends on the specific needs of particular speech synthesizers and will therefore differ from the structure of the dictionaries as stored on disk.
Associating a Dictionary With a Speech Channel
The following routines can be used to associate an application-defined pronunciation dictionary with a particular speech channel.
UseDictionary
The UseDictionary routine associates a designated dictionary with a specific speech channel.
The speech synthesizer will attempt to use the dictionary data pointed to by the dictionary handle argument to augment the built-in pronunciation rules on the specified speech channel. The synthesizer will use whatever elements of the dictionary resource it considers useful to the speech conversion process. After returning from UseDictionary, the caller is free to release any storage allocated for the dictionary handle. The search order for application-provided dictionaries is last in, first searched.
All details of how an application-provided dictionary is represented within the speech synthesizer are dependent on the specific synthesizer implementation and are totally private to the synthesizer.
RESULT CODESnoErr 0 No error
memFullErr –108 Not enough memory to use new dictionary
badDictFormat –246 Format problem with pronunciation dictionary
Each application-defined pronunciation dictionary is implemented as a single resource of type 'dict'. To read the dictionary contents, the system first reads the resource into memory using Resource Manager routines.
An application dictionary contains the following information:total byte length (long) (Length is all-inclusive)
atom type (long)
format version (long)
script code (short)
language code (short)
region code (short)
date last modified (long) (Seconds since January 1, 1904)
reserved(4) (long)
entry count (long)
list of entries
The currently defined atom type is'dict ' Æ Dictionary
Each entry consists of the following:entry byte length (short) (Length is all-inclusive)
entry type (short)
field count (short)
list of fields
The currently defined entry types are the following:0x00 Æ Null entry
0x01 to 0x20 Æ Reserved
0x21 Æ Pronunciation entry
0x22 Æ Abbreviation entry
Each field consists of the following:field byte length (short) (Length is all-inclusive minus padding)
field type (short)
field data (char[]) (Data is padded to word boundary)
The currently defined field types are the following:0x00 Æ Null field
0x01 to 0x20 Æ Reserved
0x21 Æ Word represented in textual format.
0x22 Æ Phonemic pronunciation including a complete set of syllable, lexical stress, word prominence, and prosodic markers represented in textual format
0x23 Æ Part-of-speech code
Creating and Editing Dictionaries
There is no built-in support for creating and editing speech dictionaries. You can create dictionary resources using any of the available resource editing tools such as the MPW Rez tool or ResEdit. Of course, you can also fairly easily develop routines to edit the dictionary structure from within the application. At the present time, no assumption should be made that the entries in a dictionary are stored in sorted order.
Advanced Voice Information Routines
Ordinarily, an application should need to use only the GetVoiceDescription routine to access information about a particular voice. Occasionally, however, it may be necessary to obtain more detailed information by using the GetVoiceInfo routine.
GetVoiceInfo
The GetVoiceInfo routine returns information about a specified voice channel beyond that obtainable through the GetVoiceDescription routine.
FSSpec fileSpec; // vol, dir, name info for voice file
short resID; // resource ID of voice in the file
} VoiceFileInfo;
enum {
soVoiceDescription = 'info', // gets basic voice info
soVoiceFile = 'fref' // gets voice file ref info
};
Field descriptions
*voice Specific speech channel
selector Used to specify data being requested
*voiceInfo Pointer to an information structure
DESCRIPTION
This function accepts selectors that determine the type of information you want to get. The format of the information returned depends on which value is used in the selector field, as follows:
Field descriptions
info Gets basic information for the specified voice. The structure returned is functionally equivalent to the VoiceDescription data structure in GetVoiceDescription, described earlier in this document. To maximize compatibility with future versions of the Speech Manager, the application must set the length field of the VoiceDescription structure to the size of the existing record before calling GetVoiceInfo, which then returns the size of the new record.
fref Gets file reference information for specified voice; normally only used by speech synthesizers to access voice disk files directly.
RESULT CODESnoErr 0 No error
memFullErr –108 Not enough memory to load voice into memory
voiceNotFound –244 Voice resource not found
Embedded Speech Commands
This section describes how you can insert commands directly into the input text to control or modify the spoken output. When processing input text data, speech synthesizers look for special sequences of characters called delimiters. These character sequences are usually defined to be unusual pairings of printable characters that would not normally appear in the text. When a begin command delimiter string is encountered in the text, the following characters are assumed to contain one or more commands. The synthesizer will attempt to parse and process these commands until an end command delimiter string is encountered.
Embedded Speech Command Syntax
By default, the begin command and end command delimiters are defined to be [[ and ]]. The syntax of embedded command blocks is given below, according to these rules:
n Items enclosed in angle brackets (< and >) represent logical units that are either defined further below or are atomic units that should be self-explanatory.
n Items enclosed in brackets are optional.
n Items followed by an ellipsis (…) may be repeated one or more times.
n For items separated by a vertical bar (|), any one of the listed items may be used.
n Multiple space characters between tokens may be used if desired.
n Multiple commands should be separated by semicolons.
All other characters that are not enclosed between angle brackets must be entered literally. There is no limit to the number of commands that can be included in a single command block.
OSType <4 character pattern (e.g., RATE, vers, aBcD)>
Character <Any printable character (example A, b, *, #, x)>
FixedPointValue <Decimal number: 0.0000 £ N £ 65535.9999>
32BitValue <OSType> | <LongInt> | <HexLongInt>
16BitValue <Integer> | <HexInteger>
8BitValue <Byte> | <HexByte>
LongInt <Decimal number: 0 £ N £ 4294967295>
HexLongInt <Hex number: 0x00000000 £ N £ 0xFFFFFFFF>
Integer <Decimal number: 0 £ N £ 65535>
(continued)
HexInteger <Hex number: 0x0000 £ N £ 0xFFFF>
Byte <Decimal number: 0 £ N £ 255>
HexByte <Hex number: 0x00 £ N £ 0xFF>
Here is the embedded command syntax structure:
Embedded Speech Command Set
Table 1-1 outlines the set of currently defined embedded speech commands.
Table 1-1 Embedded speech commands(continued)
Command Selector Command syntax and description
Version vers vers <Version>
Version::= <32BitValue>
This command informs the synthesizer of the format version that will be used in subsequent commands. This command is optional but is highly recommended. The current version is 1.
The delimiter command specifies the character sequences that mark the beginning and end of all subsequent commands. The new delimiters take effect at the end of the current command block. If the delimiter strings are empty, an error is generated. (Contrast this behavior with the dlim function of SetSpeechInfo.)
Comment cmnt cmnt [Character]…
This command enables a developer to insert a comment into a text stream for documentation purposes. Note that all characters following the cmnt selector up to the <EndDelimiter> are part of the comment.
Reset rset rset <32BitValue>
The reset command will reset the speech channel’s settings back to the default values. The parameter should be set to 0.
(continued)
Baseline pitch pbas pbas [+ | -] <Pitch>
Pitch ::= <FixedPointValue>
The baseline pitch command changes the current pitch for the speech channel. The pitch value is a fixed-point number in the range 1.0 through 100.0 that conforms to the frequency relationship
Hertz = 440.0 * 2((Pitch – 69) / 12)
If the pitch number is preceded by a + or – character, the baseline pitch is adjusted relative to its current value. Pitch values are always positive numbers. For further details, see “SetSpeechInfo,” earlier in this document.
The pitch modulation command changes the modulation range for the speech channel. The modulation value is a fixed-point number in the range 0.0 through 100.0 that conforms to the following pitch and frequency relationships:
Maximum pitch = BasePitch + PitchMod
Minimum pitch = BasePitch - PitchMod
Maximum Hertz = BaseHertz * 2(+ ModValue / 12)
Minimum Hertz = BaseHertz * 2(– ModValue / 12)
A value of 0.0 corresponds to no modulation and will cause the speech channel to speak in a monotone. If the modulation depth number is preceded by a + or – character, the pitch modulation is adjusted relative to its current value. For further details, see “SetSpeechInfo,” earlier in this document.
Speaking rate rate rate [+ | -] <WordsPerMinute>
WordsPerMinute ::= <FixedPointValue>
The speaking rate command sets the speaking rate in words per minute on the speech channel. If the rate value is preceded by a + or – character, the speaking rate is adjusted relative to its current value.
(continued)
Volume volm volm [+ | -] <Volume>
Volume::= <FixedPointValue>
The volume command changes the speaking volume on the speech channel. Volumes are expressed in fixed-point units ranging from 0.0 through 1.0 . A value of 0.0 corresponds to silence, and a value of 1.0 corresponds to the maximum possible volume. Volume units lie on a scale that is linear with amplitude or voltage. A doubling of perceived loudness corresponds to a doubling of the volume.
Sync sync sync <SyncMessage>
SyncMessage::= <32BitValue>
The sync command causes a callback to the application’s sync command callback routine. The callback is made when the audio corresponding to the next word begins to sound. The callback routine is passed the SyncMessage value from the command. If the callback routine has not been defined, the command is ignored. For further details, see “SetSpeechInfo,” earlier in this document.
Input mode inpt inpt TX | TEXT | PH | PHON
This command switches the input processing mode to either normal text mode or raw phoneme mode.
Character mode char char NORM | LTRL
The character mode command sets the word speaking mode of the speech synthesizer. When NORM mode is selected, the synthesizer attempts to automatically convert words into speech. This is the most basic function of the text-to-speech synthesizer. When LTRL mode is selected, the synthesizer speaks every word, number, and symbol letter by letter. Embedded command processing continues to function normally, however.
Number mode nmbr nmbr NORM | LTRL
The number mode command sets the number speaking mode of the speech synthesizer. When NORM mode is selected, the synthesizer attempts to automatically speak numeric strings as intelligently as possible. When LTRL mode is selected, numeric strings are spoken digit by digit.
(continued)
Silence slnc slnc <Milliseconds>
Milliseconds ::= <32BitValue>
The silence command causes the synthesizer to generate silence for the specified amount of time.
Emphasis emph emph + | -
The emphasis command causes the next word to be spoken with either greater emphasis or less emphasis than would normally be used. Using + will force added emphasis, while using – will force reduced emphasis.
The extension command enables synthesizer-specific commands to be embedded in the input text stream. The format of the data following SynthCreator is entirely dependent on the synthesizer being used. If a particular SynthCreator is not recognized by the synthesizer, the command is ignored but no error is generated.
Synthesizers often support embedded commands that extend the set given in Table 1-1.
Embedded Speech Command Error Reporting
While embedded speech commands are being processed, several types of errors may be detected and reported to your application. If you have set up an error callback handler with the soErrorCallBack selector of the SetSpeechInfo routine (described earlier), you will be notified once for every error that is detected. If you have not enabled error callbacks, you can still obtain information about the errors encountered by calling GetSpeechInfo with the soErrors selector (also described earlier). The following errors are detected during processing of embedded speech commands:badParmVal –245 Parameter value is invalid
badCmdText –246 Embedded command syntax or parameter problem
unimplCmd –247 Embedded command is not implemented on synthesizer
unimplMsg –248 Raw phoneme text contains invalid characters
badVoiceID –250 Specified voice has not been preloaded
badParmCount –252 Incorrect number of embedded command arguments found
Summary of Phonemes and Prosodic Controls
This section summarizes the phonemes and prosodic controls used by American English speech synthesizers.
Phoneme Set
Table 1-2 summarizes the set of standard phonemes recognized by American English speech synthesizers.
In this description, it is assumed that specific rules and markers apply only to general American English. Other languages and dialects require different phoneme inventories. Phonemes divide into two groups: vowels and consonants. All vowel symbols are uppercase pairs of letters. For consonants, in cases in which the correspondence between the consonant and its symbol is apparent, the symbol is that lowerrcase consonant; in other cases, the symbol is an uppercase consonant. Within the example words, the individual sounds being exemplified appear in bold face.
Table 1-2 American English phoneme symbols(continued)
Symbol Example Opcode Symbol Example Opcode
AE bat 2 b bin 18
EY bait 3 C chin 19
AO caught 4 d din 20
AX about 5 D them 21
IY beet 6 f fin 22
EH bet 7 g gain 23
IH bit 8 h hat 24
AY bite 9 J gin 25
IX roses 10 k kin 26
AA cot 11 l limb 27
UW boot 12 m mat 28
UH book 13 n nat 29
UX bud 14 N tang 30
OW boat 15 p pin 31
AW bout 16 r ran 32
OY boy 17 s sin 33
S shin 34
t tin 35
T thin 36
v van 37
w wet 38
(continued)
y yet 39
% silence 0 z zen 40
@ breath intake 1 Z genre 41
Note
The “silence” phoneme (%) and the “breath” phoneme (@) may be lengthened or shortened like any other phoneme.u
Prosodic Controls
The symbols listed in Table 1-3 are recognized as modifiers to the basic phonemes described in the preceding section. They can be used to more precisely control the quality of speech that is described in terms of raw phonemes.
Syllable breaks: Marks syllable breaks within a word
Syllable mark = (equal) AEn=t2IH=sIX=p1EY=SAXn (“anticipation”)Marks the beginning of a word (required)
Word prominence:
Unstressed ~ (asciitilde) Used for words with minimal information content
Normal stress _ (underscore) Used for information-bearing words
Emphatic stress + (plus) special emphasis for a wordPlaced before the affected phonemepitch will rise on the following phoneme
Prosodic
Pitch rise / (slash)
Pitch fall \ (backslash) pitch will fall on the following phoneme
Lengthen phoneme > (greater) lengthen the duration of the following phoneme
Shorten phoneme < (less) shorten the duration of the following phoneme
(continued)
Punctuation: Pitch effect Timing effect
. (period) Sentence final fall Pause follows
? (question) Sentence final rise Pause follows
! (exclam) Sentence final sharp fall Pause follows
… (ellipsis) Clause final level Pause follows
, (comma) Continuation rise Short pause follows
; (semicolon) Continuation rise Short pause follows
: (colon) Clause final level Short pause follows
( (parenleft) Start reduced range Short pause precedes
) (parenright) End reduced range Short pause follows
“
‘ (quotedblleft, quotesingleleft) Varies Varies
”
’ (quotedblright, quotesingleright) Varies Varies
- (hyphen) Clause-final level Short pause follows
& (ampersand) Forces no addition of silence between phonemes
Specific pitch contours associated with these punctuation marks may vary according to other considerations in the analysis of the text, such as whether a question is rhetorical or begins with a wh question word, so the above effects should be regarded only as guidelines and not absolute. This also applies to the timing effects, which will vary according to the current rate setting.
The prosodic control symbols (/, \, <, and >) may be concatenated to provide more exaggerated, cumulative effects. The specific nature of the effect is dependent on the speech synthesizer. Speech synthesizers also often extend or enhance the controls described in this section.
Summary of the Speech Manager
Constants
#define gestaltSpeechAttr 'ttsc' // Gestalt Manager selector for speech
attributes
enum {
gestaltSpeechMgrPresent = 0 // Gestalt bit that indicates that Speech